Google's Deep Web crawl

نویسندگان

  • Jayant Madhavan
  • David Ko
  • Lucja Kot
  • Vignesh Ganapathy
  • Alex Rasmussen
  • Alon Y. Halevy
چکیده

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DHT-Based Distributed Crawler

A search engine, like Google, is built using two pieces of infrastructure a crawler that indexes the web and a searcher that uses the index to answer user queries. While Google's crawler has worked well, there is the issue of timeliness and the lack of control given to end-users to direct the crawl according to their interests. The interface presented by such search engines is hence very limite...

متن کامل

A Personalization Recommendation Method Based on Deep Web Data Query

Deep Web is becoming a hot research topic in the area of database. Most of the existing researches mainly focus on Deep Web data integration technology. Deep Web data integration can partly satisfy people's needs of Deep Web information search, but it cannot learn users’ interest, and people search the same content online repeatedly would cause much unnecessary waste. According to this kind of ...

متن کامل

Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce

In the recent years, the publication of structured data inside HTML content of Web sites has become a mainstream feature of commercial Web sites. In particular, e-commerce sites have started to add RDFa or Microdata markup based on schema.org and GoodRelations vocabularies. For many potential usages of this huge body of data, we need to crawl the sites and extract the data from the markup. Unfo...

متن کامل

LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl

The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows Natural Language Processing (NLP) researchers to easily construct web-scale corpus the from Common Crawl Archive: a petabyte scale open repository of web crawl information. Three use-cases are presented:...

متن کامل

Deeper: A Data Enrichment System Powered by Deep Web

Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or timeint...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2008